Find the correlation of Attrition variable with all other variables in the dataset
A large company named XYZ, employs, at any given point of time, around 4000 employees. However, every year, around 15% of its employees leave the company and need to be replaced with the talent pool available in the job market. The management believes that this level of attrition (employees leaving, either on their own or because they got fired) is bad for the company, because of the following reasons -
The former employees projects get delayed, which makes it difficult to meet timelines, resulting in a reputation loss among consumers and partners A sizeable department has to be maintained, for the purposes of recruiting new talent More often than not, the new employees have to be trained for the job and/or given time to acclimatise themselves to the company
Hence, the management has contracted an HR analytics firm to understand what factors they should focus on, in order to curb attrition. In other words, they want to know what changes they should make to their workplace, in order to get most of their employees to stay. Also, they want to know which of these variables is most important and needs to be addressed right away.
Since you are one of the star analysts at the firm, this project has been given to you.
Goal of the case study You are required to model the probability of attrition. The results thus obtained will be used by the management to understand what changes they should make to their workplace, in order to get most of their employees to stay.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas_profiling import ProfileReport
import itertools
import scipy.stats as stats
from scipy.stats import ttest_1samp, ttest_ind,mannwhitneyu,levene,shapiro,wilcoxon
from statsmodels.stats.power import ttest_power
from scipy.stats import linregress
import warnings
warnings.filterwarnings('ignore')
# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)
df = pd.read_csv('general_data.csv')
map_attrition = {'Yes': 1, 'No': 0} # Convert the categorical object into numerical
map_gender = {'Male': 1, 'Female': 0}
df = df.replace({'Attrition': map_attrition})
df = df.replace({'Gender': map_gender})
df.head()
df.drop(labels=['EmployeeID','EmployeeCount','StandardHours','Over18'],axis = 1,inplace=True)
df.info()
pfr = ProfileReport(df, title="Attrition EDA")
pfr.to_file(output_file="Attrition Data Profiling.html")
pfr